Skip to content

Conversation

@tsivaprasad
Copy link
Contributor

@tsivaprasad tsivaprasad commented Dec 3, 2025

Summary

This PR implements automatic etcd mode reconfiguration of Control Plane host's etcd mode between server and client after initialization.

Changes

Automatic Reconfiguration

  • Mode Detection: Automatically detects when PGEDGE_ETCD_MODE environment variable changes
  • Server → Client: Safely demotes etcd server to client mode, removing from cluster membership
  • Client → Server: Promotes client to server mode, joining the etcd cluster with proper credentials

Testing

Manually validated:

  • Promoting a client host to server mode
1. Created a cluster with 6 nodes (3 etcd servers, 3 etcd clients) and verified that all nodes come up healthy. verified that the servers are listed correctly in the etcd cluster.
+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
|  7322dc2897edcd9 | started | host-3 | https://192.168.64.2:2580 | https://192.168.64.2:2579 |      false |
| 6c021e21fc66bdea | started | host-2 | https://192.168.64.2:2480 | https://192.168.64.2:2479 |      false |
| b71f75320dc06a6c | started | host-1 | https://192.168.64.2:2380 | https://192.168.64.2:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+
2. Modified the compose file with host-4 as server with different ports
3. using command WORKSPACE_DIR=/Users/sivat/projects/control-plane/control-plane docker compose -f ./docker/control-plane-dev/docker-compose.yaml up -d host-4
4. Verified that host-4 comes up healthy and joins the etcd cluster
+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
|  7322dc2897edcd9 | started | host-3 | https://192.168.64.2:2580 | https://192.168.64.2:2579 |      false |
| 3a97d6e09511ba61 | started | host-4 | https://192.168.64.2:2680 | https://192.168.64.2:2679 |      false |
| 6c021e21fc66bdea | started | host-2 | https://192.168.64.2:2480 | https://192.168.64.2:2479 |      false |
| b71f75320dc06a6c | started | host-1 | https://192.168.64.2:2380 | https://192.168.64.2:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

Respective Logs:

host-4-1 has been recreated
host-4-1  | 1:22PM INF got shutdown signal
host-4-1  | 1:22PM INF attempting to gracefully shut down
host-4-1  | 1:22PM INF shutting down scheduler service component=scheduler_service
host-4-1  | 1:22PM INF closing scheduled job watch component=scheduler_service
host-4-1  | 1:22PM INF shutting down instance monitors
host-4-1 exited with code 0
host-4-1  | 1:22PM INF checking etcd mode for reconfiguration modes_equal=false new_mode=server old_mode=client old_mode_empty=false
host-4-1  | 1:22PM INF etcd started as learner
host-4-1  | 1:22PM INF attempting to promote from learner to voting cluster member
host-4-1  | 1:22PM INF promotion successful
host-4-1  | 1:22PM INF waiting for cluster to be healthy
host-4-1  | 1:22PM INF starting scheduler service component=scheduler_service
host-4-1  | 1:22PM INF starting http server component=api_server host_port=0.0.0.0:3003
  • Demoting a server host to client mode
1. Rolled back the changes by bringing down host-4 and modifying the compose file to set host-4 as client
2. using command WORKSPACE_DIR=/Users/sivat/projects/control-plane/control-plane docker compose -f ./docker/control-plane-dev/docker-compose.yaml up -d host-4
3. Verified that host-4 comes up healthy and its in client mode
+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
|  7322dc2897edcd9 | started | host-3 | https://192.168.64.2:2580 | https://192.168.64.2:2579 |      false |
| 6c021e21fc66bdea | started | host-2 | https://192.168.64.2:2480 | https://192.168.64.2:2479 |      false |
| b71f75320dc06a6c | started | host-1 | https://192.168.64.2:2380 | https://192.168.64.2:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+

Respective Logs:

host-4-1 has been recreated
host-4-1  | 1:24PM INF got shutdown signal
host-4-1  | 1:24PM INF attempting to gracefully shut down
host-4-1  | 1:24PM INF shutting down scheduler service component=scheduler_service
host-4-1  | 1:24PM INF closing scheduled job watch component=scheduler_service
host-4-1  | 1:24PM INF shutting down instance monitors
host-4-1 exited with code 0
host-4-1  | 1:24PM INF checking etcd mode for reconfiguration modes_equal=false new_mode=client old_mode=server old_mode_empty=false
host-4-1  | 1:24PM INF starting http server component=api_server host_port=0.0.0.0:3003

PLAT-315

Summary by CodeRabbit

  • New Features

    • Enabled automatic etcd mode reconfiguration between embedded server and remote client modes with seamless transitions between configurations.
  • Improvements

    • Made etcd startup idempotent to prevent errors when already initialized.

✏️ Tip: You can customize this high-level summary in your review settings.

@tsivaprasad tsivaprasad requested a review from Copilot December 3, 2025 17:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for reconfiguring etcd mode on Control Plane hosts, enabling operators to safely change hosts between server and client modes after initial deployment. The documentation includes detailed procedures, troubleshooting guidance, and best practices based on manual testing in a development environment.

Key Changes

  • Added complete step-by-step procedures for promoting clients to servers and demoting servers to clients
  • Documented environment variable configuration requirements and container restart workflows
  • Fixed a bug in rbac.go to ensure EtcdMode is persisted when writing host credentials

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
docs/using/etcd-reconfiguration.md New documentation file covering etcd mode reconfiguration procedures, prerequisites, troubleshooting, and best practices
server/internal/etcd/rbac.go Added missing EtcdMode assignment to ensure mode is preserved in generated config

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tsivaprasad tsivaprasad marked this pull request as draft December 4, 2025 14:44
@tsivaprasad tsivaprasad force-pushed the PLAT-315-ability-to-reconfigure-a-hosts-etcd-configuration branch 4 times, most recently from 6b6efb0 to 4024167 Compare January 1, 2026 17:25
@tsivaprasad tsivaprasad marked this pull request as ready for review January 1, 2026 17:54
@tsivaprasad tsivaprasad requested a review from Copilot January 5, 2026 11:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link

coderabbitai bot commented Jan 9, 2026

📝 Walkthrough

Walkthrough

Introduces automatic etcd mode transition capability through the ChangeMode method, enabling seamless switching between server and client configurations. Implements idempotent start, refactors certificate handling with a new helper function, and extends configuration management to coordinate bidirectional mode transitions with cluster operations.

Changes

Cohort / File(s) Summary
Core mode transition system
server/internal/etcd/interface.go, server/internal/etcd/embedded.go, server/internal/etcd/remote.go, server/internal/etcd/provide.go
Adds ChangeMode interface method; implements mode transitions (Server↔Client) with cluster member updates, configuration persistence, and instance lifecycle management; introduces newEtcdForMode helper and reconfiguration logic with 60-second timeouts for transitions
Credentials and RBAC
server/internal/etcd/rbac.go
Extracts etcd server certificate creation into addEtcdServerCredentials helper function; propagates EtcdMode, EtcdClient, and EtcdServer settings to generated configuration in writeHostCredentials
Changelog entry
changes/unreleased/Added-20260109-223536.yaml
Documents feature: "Enabled automatic etcd client ↔ server mode reconfiguration"

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant ProvideEtcd as provideEtcd
    participant EmbeddedEtcd as EmbeddedEtcd
    participant RemoteEtcd as RemoteEtcd
    participant Cluster as Etcd Cluster

    App->>ProvideEtcd: Provision Etcd (mode change detected)
    Note over ProvideEtcd: oldMode=Server, newMode=Client
    ProvideEtcd->>EmbeddedEtcd: ChangeMode(ctx, Client)
    EmbeddedEtcd->>EmbeddedEtcd: Start instance
    EmbeddedEtcd->>Cluster: List cluster members
    EmbeddedEtcd->>EmbeddedEtcd: Update config with client endpoints
    EmbeddedEtcd->>EmbeddedEtcd: Persist configuration
    EmbeddedEtcd->>EmbeddedEtcd: Shutdown embedded etcd
    EmbeddedEtcd->>RemoteEtcd: Create NewRemoteEtcd
    RemoteEtcd->>RemoteEtcd: Start remote client
    EmbeddedEtcd->>Cluster: Remove self from cluster
    EmbeddedEtcd->>EmbeddedEtcd: Remove data directory
    EmbeddedEtcd->>EmbeddedEtcd: Update config to Client mode
    EmbeddedEtcd->>ProvideEtcd: Return RemoteEtcd instance
    ProvideEtcd->>App: Etcd provider ready (Client mode)
Loading
sequenceDiagram
    participant App as Application
    participant ProvideEtcd as provideEtcd
    participant RemoteEtcd as RemoteEtcd
    participant EmbeddedEtcd as EmbeddedEtcd
    participant Cluster as Etcd Cluster

    App->>ProvideEtcd: Provision Etcd (mode change detected)
    Note over ProvideEtcd: oldMode=Client, newMode=Server
    ProvideEtcd->>RemoteEtcd: ChangeMode(ctx, Server)
    RemoteEtcd->>RemoteEtcd: Ensure started
    RemoteEtcd->>RemoteEtcd: Get client principal & credentials
    RemoteEtcd->>RemoteEtcd: Persist server credentials
    RemoteEtcd->>Cluster: Get cluster leader
    RemoteEtcd->>RemoteEtcd: Shutdown remote client
    RemoteEtcd->>EmbeddedEtcd: Create NewEmbeddedEtcd
    EmbeddedEtcd->>Cluster: Join cluster via leader
    EmbeddedEtcd->>EmbeddedEtcd: Update config to Server mode
    EmbeddedEtcd->>EmbeddedEtcd: Persist configuration
    EmbeddedEtcd->>ProvideEtcd: Return EmbeddedEtcd instance
    ProvideEtcd->>App: Etcd provider ready (Server mode)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hops with glee through cluster halls,
Server switches dance to client calls,
Mode transitions smooth and clean,
The finest etcd shift you've seen!
ChangeMode magic, hop hooray! 🎉

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title references adding documentation for etcd mode reconfiguration, but the actual changes are code implementation (new ChangeMode methods, refactored etcd provision logic, and YAML changelog entry) rather than documentation. Update the title to reflect the code changes, e.g., 'feat: implement automatic etcd mode reconfiguration (server ↔ client)' or 'feat: enable etcd mode transitions between server and client'.
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description covers the feature summary, manual testing with clear examples, and issue linking, but lacks unit/e2e test information and does not explicitly state that documentation was added as mentioned in the objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e26eed and 4ba07dd.

📒 Files selected for processing (1)
  • changes/unreleased/Added-20260109-223536.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (1)
changes/unreleased/Added-20260109-223536.yaml (1)

1-3: LGTM!

The changelog entry accurately documents the new etcd reconfiguration feature. The YAML structure is valid, and the description is concise and aligned with the PR objectives.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
docs/using/etcd-reconfiguration.md (6)

44-44: Clarify "no configuration needed" — updating the environment variable is still a configuration change.

Line 44 states "No manual API calls or configuration needed!" but this slightly overstates the automation. Users must still update PGEDGE_ETCD_MODE in their environment. A more accurate phrasing would emphasize that the cluster operations (join, membership updates, credentials) are automatic, not that configuration is unnecessary.

Suggested revision: "No manual etcd API calls or cluster member operations needed — the Control Plane handles everything automatically!"


106-107: Expand RBAC troubleshooting guidance.

The "Permission denied" remedy mentions checking RBAC is enabled but doesn't explain how to verify this or what corrective steps to take. Consider adding:

  • How to check if RBAC is enabled on the cluster
  • Whether RBAC misconfiguration is the primary cause or a secondary check
  • What additional logs to inspect if credential provisioning fails

113-114: Clarify the behavior when membership removal fails during demotion.

The statement "System continues transition even if removal fails" is ambiguous. This suggests the host may transition to client mode while still being a voting member of the cluster — a potentially problematic state. Clarify:

  • Does the host become a client-mode node without cluster membership (desired outcome)?
  • Does the host remain a voting member but run in client mode (problematic)?
  • What does "continues transition" mean operationally?
  • Should users manually remove the member via etcdctl if automatic removal fails?

132-132: Clarify expected etcdctl member list status values.

The statement "All members should show STATUS=started" may be imprecise. The etcdctl member list command output format and status values can vary. Provide:

  • The exact expected output format or example
  • A list of acceptable status values (e.g., started, healthy, etc.)
  • What to do if members show a different status (e.g., unstarted)

1-145: Reduce exclamation marks for a more professional tone.

The documentation uses 7 exclamation marks, which is flagged as excessive by style analysis. For technical documentation, a more neutral and measured tone enhances credibility. Consider removing or replacing exclamation marks, especially in:

  • Line 44: "No manual API calls or configuration needed!"
  • Line 145: Summary closing (if it contains an exclamation)

Replace with more neutral punctuation (periods, em-dashes) or rephrase to emphasize facts rather than enthusiasm.


135-145: Consider adding safety and edge-case guidance.

The documentation focuses on the happy path but omits important safety considerations:

  • Backup recommendations: Should users back up etcd data before reconfiguring? If so, how?
  • Network failures: What happens if a host loses connectivity during the reconfiguration process? Can it recover gracefully?
  • Data handling: When promoting a client to server, what happens to any existing etcd data? Does it start fresh or reuse previous data?
  • Rollback procedures: If reconfiguration fails, how should users recover or revert?

Consider adding a "Safety and Backups" section addressing these scenarios, especially for production environments where these edge cases are critical.

server/internal/etcd/provide.go (1)

68-70: Context created unconditionally but only used in transition cases.

The 60-second timeout context is created for all code paths, but it's only necessary for ChangeMode calls. While not incorrect (the defer ensures cleanup), consider creating the context only within the transition cases to make intent clearer.

server/internal/etcd/embedded.go (1)

521-521: Return statement may use stale error value.

The err variable here is from Line 501 (remote.GetClient()). If that succeeded and all subsequent operations also succeeded, err would still be nil, which is correct. However, for clarity and to avoid confusion with shadowed or stale error variables, consider returning nil explicitly.

♻️ Suggested fix
-	return remote, err
+	return remote, nil
server/internal/host/host.go (1)

45-45: Consider adding port range validation.

While the current implementation is functional, adding validation to ensure HTTPPort falls within the valid range (1-65535) would improve robustness. This could be implemented at the configuration loading stage or in the service layer before persistence.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f218ca and 76d0185.

📒 Files selected for processing (9)
  • docs/using/etcd-reconfiguration.md
  • server/internal/etcd/embedded.go
  • server/internal/etcd/interface.go
  • server/internal/etcd/provide.go
  • server/internal/etcd/rbac.go
  • server/internal/etcd/remote.go
  • server/internal/host/host.go
  • server/internal/host/host_store.go
  • server/internal/host/service.go
🧰 Additional context used
🧬 Code graph analysis (5)
server/internal/host/service.go (1)
server/internal/config/config.go (1)
  • HTTP (112-121)
server/internal/etcd/rbac.go (2)
server/internal/config/config.go (3)
  • EtcdMode (186-186)
  • EtcdClient (163-166)
  • EtcdServer (143-147)
server/internal/etcd/interface.go (1)
  • HostCredentials (30-38)
server/internal/etcd/provide.go (5)
server/internal/config/config.go (4)
  • EtcdMode (186-186)
  • EtcdModeServer (189-189)
  • EtcdModeClient (190-190)
  • Config (193-215)
server/internal/config/manager.go (1)
  • Manager (15-20)
server/internal/etcd/interface.go (1)
  • Etcd (40-55)
server/internal/etcd/embedded.go (1)
  • NewEmbeddedEtcd (44-50)
server/internal/etcd/remote.go (1)
  • NewRemoteEtcd (34-41)
server/internal/etcd/interface.go (1)
server/internal/config/config.go (1)
  • EtcdMode (186-186)
server/internal/etcd/embedded.go (4)
server/internal/config/config.go (6)
  • EtcdMode (186-186)
  • EtcdModeClient (190-190)
  • EtcdModeServer (189-189)
  • Config (193-215)
  • EtcdClient (163-166)
  • EtcdServer (143-147)
server/internal/etcd/interface.go (1)
  • Etcd (40-55)
server/internal/etcd/remote.go (1)
  • NewRemoteEtcd (34-41)
server/internal/etcd/membership.go (1)
  • RemoveMember (34-52)
🪛 LanguageTool
docs/using/etcd-reconfiguration.md

[style] ~73-~73: Using many exclamation marks might seem excessive (in this case: 7 exclamation marks for a text that’s 2568 characters long)
Context: ...a Server to Client (Example - host-4) !!! warning "Quorum Check" Ensure at lea...

(EN_EXCESSIVE_EXCLAMATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (22)
server/internal/etcd/rbac.go (3)

54-56: LGTM — Good refactor to centralize server credential creation.

Extracting the server certificate creation into addEtcdServerCredentials reduces duplication and makes the code more maintainable, especially given this logic is now reused in RemoteEtcd.ChangeMode.


347-350: LGTM — Config propagation aligns with mode reconfiguration flow.

Propagating EtcdMode, EtcdClient, and EtcdServer from app config to generated config ensures configuration consistency during host credential writes.


358-381: LGTM — Helper function is well-structured.

The addEtcdServerCredentials helper correctly creates server certificates and assigns them to the credentials struct. Error handling is appropriate with contextual wrapping.

server/internal/etcd/interface.go (1)

54-54: LGTM — Interface extension is well-designed.

Returning (Etcd, error) is the correct approach since mode transitions switch between EmbeddedEtcd and RemoteEtcd implementations.

server/internal/etcd/provide.go (3)

32-42: LGTM — Clean factory function.

The newEtcdForMode helper provides a clear mapping from mode to implementation with appropriate error handling for invalid modes.


55-66: LGTM — Good diagnostic logging for mode transitions.

Logging both old and new modes with boolean flags helps with debugging reconfiguration issues.


72-100: LGTM — Mode transition logic is comprehensive.

The switch statement correctly handles:

  • Initial setup or no-change case (persisting config if initialized)
  • Server → Client transition via EmbeddedEtcd.ChangeMode
  • Client → Server transition via RemoteEtcd.ChangeMode
  • Explicit error for unsupported transitions
server/internal/etcd/remote.go (5)

254-257: LGTM — Mode validation is correct.

Correctly rejects transitions to any mode other than EtcdModeServer when called on a RemoteEtcd instance.


259-261: Starting remote etcd ensures connectivity before credential operations.

Good defensive approach to ensure the remote client is connected and cert service is initialized before attempting the transition.


270-280: Credentials construction reuses existing client certs.

The credentials are correctly assembled from existing client credentials plus newly-created server credentials via addEtcdServerCredentials.


292-303: Verify cluster rejoin behavior if embedded join fails after shutdown.

After r.Shutdown() closes the remote client, the subsequent embedded.Join() must succeed for the host to remain functional. If join fails, the host would be left in an inconsistent state (remote shut down, embedded not started).

Consider whether a rollback mechanism or additional error handling is needed for production resilience.


305-311: Config update after successful join is appropriate.

Updating the generated config only after successful join ensures configuration consistency.

server/internal/etcd/embedded.go (5)

56-58: LGTM — Idempotent start guard.

Returning early if already started prevents double-initialization issues and supports the mode change flow where Start may be called on an already-running instance.


299-299: LGTM — Nil assignment enables restart capability.

Setting e.etcd = nil after close allows the instance to be started again if needed, which aligns with the idempotent start guard.


451-454: LGTM — Mode validation is correct.

Correctly rejects transitions to any mode other than EtcdModeClient when called on an EmbeddedEtcd instance.


473-484: Good validation — prevents demotion when no other members exist.

Checking for remaining endpoints before proceeding prevents leaving the host in an unrecoverable state.


492-508: Transition order is correct for safety.

The sequence (shutdown embedded → start remote → remove member) ensures the remote client can communicate with remaining cluster members to perform the removal.

server/internal/host/service.go (1)

47-47: LGTM! HTTPPort initialization is consistent with the pattern.

The addition of HTTPPort from the configuration aligns with the other host fields being initialized in this struct literal.

server/internal/host/host.go (3)

45-45: HTTPPort field addition is correctly positioned.

The new field integrates cleanly into the Host struct and maintains consistency with the storage layer.


97-97: Round-trip conversion of HTTPPort is correct.

The fromStorage mapping properly extracts HTTPPort from the stored representation.


146-146: toStorage conversion correctly includes HTTPPort.

The symmetric mapping ensures HTTPPort persists correctly. The conversions in lines 97 and 146 form a proper round-trip.

server/internal/host/host_store.go (1)

24-24: HTTPPort field implementation is correct and properly initialized.

The field follows the established pattern and naming conventions. HTTPPort is always initialized from configuration (s.cfg.HTTP.Port) in service.go and is not exposed in API responses, so downstream API consumers will not encounter zero or missing values. No verification needed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @docs/using/etcd-reconfiguration.md:
- Around line 30-42: Update the "How It Works" / Etcd mode reconfiguration text
to include expected downtime and user-facing disruption details: add a short
paragraph after the numbered steps that explains whether service connections are
dropped during Stop/Restart, approximate duration of typical transitions (e.g.,
seconds to a few minutes), impact on cluster quorum and write availability while
a node transitions modes, and any recommended operator actions (e.g., schedule
maintenance window, drain connections before changing PGEDGE_ETCD_MODE, monitor
logs/health). Reference PGEDGE_ETCD_MODE and the existing "Client→Server" /
"Server→Client" bullets so readers know the guidance applies to that
reconfiguration flow.
- Around line 54-59: Replace the incorrect service name literal "host-4" with
the actual service name "control-plane-host-4" in the docker-compose/docker
commands shown (e.g., change "docker-compose up -d host-4" and any other
commands like "docker stop host-4" / "docker logs host-4" referenced in the
demotion procedure) so all occurrences (notably the two instances in the
reconfiguration and demotion steps) consistently use "control-plane-host-4".
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76d0185 and b4c2bb7.

📒 Files selected for processing (1)
  • docs/using/etcd-reconfiguration.md
🧰 Additional context used
🪛 LanguageTool
docs/using/etcd-reconfiguration.md

[style] ~69-~69: Using many exclamation marks might seem excessive (in this case: 6 exclamation marks for a text that’s 1836 characters long)
Context: ...a Server to Client (Example - host-4) !!! warning "Quorum Check" Ensure at lea...

(EN_EXCESSIVE_EXCLAMATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis

@tsivaprasad tsivaprasad force-pushed the PLAT-315-ability-to-reconfigure-a-hosts-etcd-configuration branch from b4c2bb7 to 0e26eed Compare January 9, 2026 13:13
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @docs/using/etcd-reconfiguration.md:
- Around line 97-102: The docs incorrectly state that `etcdctl member list`
shows `STATUS=started`; change the example to use `etcdctl endpoint health
--cluster` (replace the `docker exec control-plane-host-1 etcdctl member list`
line with `docker exec control-plane-host-1 etcdctl endpoint health --cluster`)
and update the expected text from "All members should show `STATUS=started`" to
"All endpoints should show `is healthy: successfully committed proposal`" so the
documented command and expected output match etcd's actual output.

In @server/internal/etcd/remote.go:
- Around line 292-303: The code calls r.Shutdown() then attempts to start an
embedded server with NewEmbeddedEtcd(...).Join(...); if Join fails the process
loses its remote client and may be left without any etcd connection. Modify the
flow around r.Shutdown, NewEmbeddedEtcd, and embedded.Join so you either (a)
attempt to join the embedded server before permanently shutting down the remote
client or (b) if Join fails, automatically attempt to recover by restarting the
remote client (or rolling back by re-initializing r) and return a richer error;
include clear context in the returned error (mention r.Shutdown, embedded.Join
and JoinOptions) and log recovery attempts so operators can diagnose and recover
from failures.
🧹 Nitpick comments (1)
server/internal/etcd/provide.go (1)

72-91: Consider limiting config persistence to first initialization only.

When oldMode == newMode, the config update on lines 82-89 writes the same values that are already persisted. This could be optimized to only persist when oldMode == "" (first initialization).

♻️ Suggested optimization
 	switch {
 	case oldMode == "" || oldMode == newMode:
 		etcd, err := newEtcdForMode(newMode, cfg, logger)
 		if err != nil {
 			return nil, err
 		}
 		initialized, err := etcd.IsInitialized()
 		if err != nil {
 			return nil, err
 		}
-		if initialized {
+		if initialized && oldMode == "" {
 			generated.EtcdMode = appCfg.EtcdMode
 			generated.EtcdServer = appCfg.EtcdServer
 			generated.EtcdClient = appCfg.EtcdClient
 			if err := cfg.UpdateGeneratedConfig(generated); err != nil {
 				return nil, fmt.Errorf("failed to persist etcd configuration: %w", err)
 			}
 		}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4c2bb7 and 0e26eed.

📒 Files selected for processing (6)
  • docs/using/etcd-reconfiguration.md
  • server/internal/etcd/embedded.go
  • server/internal/etcd/interface.go
  • server/internal/etcd/provide.go
  • server/internal/etcd/rbac.go
  • server/internal/etcd/remote.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • server/internal/etcd/interface.go
  • server/internal/etcd/embedded.go
🧰 Additional context used
🧬 Code graph analysis (1)
server/internal/etcd/rbac.go (2)
server/internal/config/config.go (3)
  • EtcdMode (186-186)
  • EtcdClient (163-166)
  • EtcdServer (143-147)
server/internal/etcd/interface.go (1)
  • HostCredentials (30-38)
🪛 LanguageTool
docs/using/etcd-reconfiguration.md

[style] ~69-~69: Using many exclamation marks might seem excessive (in this case: 6 exclamation marks for a text that’s 1836 characters long)
Context: ...a Server to Client (Example - host-4) !!! warning "Quorum Check" Ensure at lea...

(EN_EXCESSIVE_EXCLAMATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (9)
docs/using/etcd-reconfiguration.md (1)

1-115: Documentation is clear and well-structured.

The guide effectively explains the automatic reconfiguration workflow, provides clear procedures with verification steps, and includes important best practices for maintaining cluster quorum. The topology recommendations and warnings are appropriate.

server/internal/etcd/rbac.go (3)

52-57: Clean extraction of server credential creation.

The delegation to addEtcdServerCredentials improves code reuse and aligns with the new ChangeMode flows that also need to create server credentials.


347-350: Propagating etcd configuration to generated config.

This ensures the generated config reflects the current etcd mode and settings, which is essential for detecting mode changes on restart.


358-381: Well-structured credential helper.

The addEtcdServerCredentials function correctly creates the server certificate with appropriate DNS names and IP addresses, and follows the existing error handling patterns.

server/internal/etcd/provide.go (2)

32-42: Clean factory pattern for etcd mode selection.

The newEtcdForMode function provides a clear mapping between configuration modes and their implementations.


92-100: Mode transition logic is correct.

The switch cases correctly instantiate the appropriate etcd implementation based on the old mode (since that's what's currently running) and then invoke ChangeMode to transition to the new mode. The 60-second timeout provides adequate time for cluster operations.

server/internal/etcd/remote.go (3)

254-261: Mode validation and initialization are correct.

The method correctly validates that the only supported transition is from client to server mode, and ensures the remote client is started before proceeding with the mode change.


305-314: Config update correctly clears client settings.

The generated config is properly updated to reflect the new server mode, clearing client endpoints and preserving server settings.


270-276: No action needed—credentials are correctly sourced.

cfg.EtcdUsername and cfg.EtcdPassword are correctly read from cfg.Config(). The Config() method returns a combined configuration that merges user config, generated config, and defaults. When credentials are stored in the generated config (as in rbac.go lines 345–346), UpdateGeneratedConfig() triggers a reload that merges those credentials into the main config. Therefore, credentials set during the initial etcd join are available when ChangeMode calls cfg.Config().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants